64 research outputs found

    Moniteurs hiérarchiques de performance, pour gérer l’utilisation des ressources partagées de la topologie

    Get PDF
    National audienceL'avènement des machines multicoeurs et manycoeurs a permis en partie de soutenir la crois-sance de la performance des machines de calcul. Cependant, l'accroissement du nombre d'uni-tés d'exécution n'est pas nécessairement synonyme de réduction de la durée d'exécution. Les différentes tâches d'une application peuvent entre autre s'interrompre pour envoyer ou at-tendre des messages (synchronisations) ou concourir pour utiliser une ressource matérielle partagée dans la machine. Lorsque la modification du code n'est pas permise (e.g dans un support d'exécution), le placement des tâches et des données permet de surmonter en partie ces problèmes, respectivement en minimisant le chemin des communications, ou en équilibrant la charge sur la machine. Dans ces deux cas, il est nécessaire d'avoir un modèle de l'architec-ture et des métriques sur l'utilisation de celle-ci, pour espérer caractériser le couple (machine, application) et calculer un placement efficace. Dans ce contexte, nous proposons un outil de mesure de performances, permettant d'agréger des évènements collectés au cours du temps sur des noeuds de la topologie d'une machine afin d'en tirer une analyse couplée du programme et de la machine. Cet outil est basé sur un modèle d'architecture fourni par hwloc et des greffons de collecte d'évènements (implémentés avec papi et maqao) et permet l'analyse d'applications parallèles sur un système. L'outil proposé ne réalise pas directement cette analyse mais propose des mécanismes pour y parvenir. Un utilitaire permet d'afficher la topologie et l'évolution des évènements ainsi que de générer une trace des évènements collectés, accompagnés de leur localisation, au cours du temps. Nous montrons avec une application simple écrite pour l'occasion qu'il est possible d'utiliser les résultats fournis par notre outil pour déduire un placement des fils d'exécution plus efficace

    A Topology-Aware Performance Monitoring Tool for Shared Resource Management in Multicore Systems

    Get PDF
    International audienceNowadays, performance optimization involves careful data and task placement to deal with parallel application needs with respect to the underlying hardware topology. Monitoring the application behavior provides useful information that still needs to be matched with the actual placement, for instance to understand whether bottlenecks are caused by the sequential code itself or by shared resources in parallel programs. We propose an insightful monitoring tool based on two cornerstones of hardware performance counters monitoring and hardware locality mod-eling, respectively named PAPI and hwloc. It enables a dynamic visual analysis of parallel applications' phases at runtime, revealing their possibly variable and heterogeneous behaviors and needs. A purpose designed application shows that the topology-aware visual representation of hardware counters can help guring out shared resource bottlenecks and ease the task placement decision process in runtime systems. 1 Introduction The memory wall makes data locality increasingly important on the road to exascale. Data and computing tasks have to be colocated to better exploit the performance of parallel platforms. Many research projects focus on locality-aware data and/or task placement, for parallel programing models ranging from MPI and OpenMP to graphs of tasks. However nding out which placement is the best remains a dicult exercise that depends on the topology and characteristics of the hardware and on the application needs. Indeed, the hardware is increasingly complex, and software anities can be of dierent kinds. For instance memory-bound tasks may prefer being scattered all across the machine, while, on the contrary, communication and synchronization may want to keep them close. Runtime systems require help identifying these needs and bottlenecks before they can place tasks accordingly. Performance monitoring is a very active software area that oers many tools to gather information about the execution of tasks, the bottlenecks, etc. We introduce , in this paper, a new way to analyze performance by crossing the roads of performance monitoring and topology-aware placement. We propose an extension of the Hardware Locality software (hwloc [2]) that enhances its graphica

    VirtualEnaction: A Platform for Systemic Neuroscience Simulation.

    Get PDF
    International audienceConsidering the experimental study of systemic models of the brain as a whole (in contrast to models of one brain area or aspect), there is a real need for tools designed to realistically simulate these models and to experiment them. We explain here why a robotic setup is not necessarily the best choice, and what are the general requirements for such a bench-marking platform. A step further, we describe an effective solution, freely available on line and already in use to validate functional models of the brain. This solution is a digital platform where the brainy-bot implementing the model to study is embedded in a simplified but realistic controlled environment. From visual, tactile and olfactory input, to body, arm and eye motor command, in addition to vital somesthetic cues, complex survival behaviors can be experimented. The platform is also complemented with algorithmic high-level cognitive modules, making the job of building biologically plausible bots easier.Dans le domaine de l'étude expérimentale des modèles systémiques du cerveau pris dans son ensemble (par opposition aux modèles de la zone du cerveau ou une image), il y a un réel besoin d'outils conçus pour simuler de manière réaliste ces modèles et les expérimenter. Nous expliquons ici pourquoi une installation robotique n'est pas nécessairement le meilleur choix, et quelles sont les exigences générales d'une telle plateforme en terme de benchmarking. Nous décrivons alors une solution efficace, disponible gratuitement en ligne et déjà utilisées pour valider les modèles fonctionnels du cerveau. Cette solution est une plate-forme numérique où un "brainy-bot" implémente le modèle étudié et permet son intégration dans un environnement de survie contrôlé, simplifié, mais réaliste. Des entrées visuelles, tactiles et olfactive, un corps très simplifié, un bras et une commandande mootrice des yeux, en plus de la somesthésie des variavles vitales sont disponibles. De ce fait, des comportements de survie complexes peuvent être expérimentées. La plate-forme est également complétée par des modules algorithmiques de simulation de fonctions cognitives de haut niveau, facilitant le travail de construction de comportement biologiquement plausibles

    Data and Thread Placement in NUMA Architectures: A Statistical Learning Approach

    Get PDF
    International audienceNowadays, NUMA architectures are common in compute-intensive systems. Achieving high performance for multi-threaded application requires both a careful placement of threads on computing units and a thorough allocation of data in memory. Finding such a placement is a hard problem to solve, because performance depends on complex interactions in several layers of the memory hierarchy. In this paper we propose a black-box approach to decide if an application execution time can be impacted by the placement of its threads and data, and in such a case, to choose the best placement strategy to adopt. We show that it is possible to reach near-optimal placement policy selection. Furthermore, solutions work across several recent processor architectures and decisions can be taken with a single run of low overhead profiling

    Modeling Non-Uniform Memory Access on Large Compute Nodes with the Cache-Aware Roofline Model

    Get PDF
    International audienceNUMA platforms, emerging memory architectures with on-package high bandwidth memories bring new opportunities and challenges to bridge the gap between computing power and memory performance. Heterogeneous memory machines feature several performance trade-offs, depending on the kind of memory used, when writing or reading it. Finding memory performance upper-bounds subject to such trade-offs aligns with the numerous interests of measuring computing system performance. In particular, representing applications performance with respect to the platform performance bounds has been addressed in the state-of-the-art Cache-Aware Roofline Model (CARM) to troubleshoot performance issues. In this paper, we present a Locality-Aware extension (LARM) of the CARM to model NUMA platforms bottlenecks, such as contention and remote access. On top of this, the new contribution of this paper is the design and validation of a novel hybrid memory bandwidth model. This new hybrid model quantifies the achievable bandwidth upper-bound under above-described trade-offs with less than 3% error. Hence, when comparing applications performance with the maximum attainable performance, software designers can now rely on more accurate information

    Automatic Cache Aware Roofline Model Building and Validation Using

    Get PDF
    Proceedings of: Third International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2016). Sofia (Bulgaria), October, 6-7, 2016.The ever growing complexity of high performance computing systems imposes significant challenges to exploit as much as possible their computational and memory resources. Recently, the Cache-aware Roofline Model has gained popularity due to its simplicity when modeling multi-cores with complex memory hierarchy, characterizing applications bottlenecks, and quantifying achieved or remaining improvements. In this short paper we involve hardware locality topology detection to build the Cache Aware Roofline Model for modern processors in an open-source locality-aware tool. The proposed tool also includes a set of specific micro-benchmarks to assess the micro-architecture performance upper-bounds. The experimental results show that by relying on the proposed tool, it was possible to reach near-theoretical bounds of an Intel 3770K processor, thus proving the effectiveness of the modeling methodology.We would like to acknowledge Action IC1305 (NESUS) for funding this work

    Narrowing the Search Space of Applications Mapping on Hierarchical Topologies

    Get PDF
    To be held in conjunction with SC21International audienceProcessor architectures at exascale and beyond are expected to continue to suffer from nonuniform access issues to in-die and node-wide shared resources. Mapping applications onto these resource hierarchies is an on-going performance concern, requiring specific care for increasing locality and resource sharing but also for ensuing contention. Application-agnostic approaches to search efficient mappings are based on heuristics. Indeed, the size of the search space makes it impractical to find optimal solutions nowadays and will only worsen as the complexity of computing systems increases over time. In this paper we leverage the hierarchical structure of modern compute nodes to reduce the size of this search space. As a result, we facilitate the search for optimal mappings and improve the ability to evaluate existing heuristics.Using widely known benchmarks, we show that permuting thread and process placement per node of a hierarchical topology leads to similar performances. As a result, the mapping search space can be narrowed down by several orders of magnitude when performing exhaustive search. This reduced search space will enable the design of new approaches, including exhaustive search or automatic exploration. Moreover, it provides new insights into heuristic-based approaches, including better upper bounds and smaller solution space

    Modeling Large Compute Nodes with Heterogeneous Memories with Cache-Aware Roofline Model

    Get PDF
    International audienceIn order to fulfill modern applications needs, computing systems become more powerful, heterogeneous and complex. NUMA platforms and emerging high bandwidth memories offer new opportunities for performance improvements. However they also increase hardware and software complexity, thus making application performance analysis and optimization an even harder task. The Cache-Aware Roofline Model (CARM) is an insightful, yet simple model designed to address this issue. It provides feedback on potential applications bottlenecks and shows how far is the application performance from the achievable hardware upper-bounds. However, it does not encompass NUMA systems and next generation processors with heterogeneous memories. Yet, some application bottlenecks belong to those memory subsystems, and would benefit from the CARM insights. In this paper, we fill the missing requirements to scope recent large shared memory systems with the CARM. We provide the methodology to instantiate, and validate the model on a NUMA system as well as on the latest Xeon Phi processor equiped with configurable hybrid memory. Finally, we show the model ability to exhibits several bottlenecks of such systems, which were not supported by CARM

    From biological to numerical experiments in systemic neuroscience: a simulation platform

    Get PDF
    International audienceStudying and modeling the brain as a whole is a real challenge. For such systemic models (in contrast to models of one brain area or aspect), there is a real need for new tools designed to perform complex numerical experiments, beyond usual tools distributed in the computer science and neuroscience communities. Here, we describe an effective solution, freely available on line and already in use, to validate such models of the brain functions. We explain why this is the best choice, as a complement to robotic setup, and what are the general requirements for such a benchmarking platform. In this experimental setup, the brainy-bot implementing the model to study is embedded in a simplified but realistic controlled environment. From visual, tactile and olfactory input, to body, arm and eye motor command, in addition to vital interoceptive cues, complex survival behaviors can be experimented. We also discuss here algorithmic high-level cognitive modules, making the job of building biologically plausible bots easier. The key point is to possibly alternate the use of symbolic representation and of complementary and usual neural coding. As a consequence, algorithmic principles have to be considered at higher abstract level, beyond a given data representation, which is an interesting challenge
    • …
    corecore